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Abstract 

We reproduce the 1975 derivation of the alternating least squares algorithm for 
squared distance scaling, from an internal report that got lost in the folds of time. 
In addition, we present a derivation and a substantial speed improvement based on 
majorization. 
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1 Introduction 

Volume 73 of the Journal of Statistical Software is a festschrift for Jan de Leeuw (me). It 
contains a very nice article by Yoshio Takane, discussing our early interactions and some 
of my “lost papers” (Takane (2016)). One of these lost papers is De Leeuw (1975), which 
discusses the ELEGANT algorithm, an augmentation method for squared distance scaling. 
The algorithm has been discussed in the past by Takane (1977), Takane (1980), Browne 
(1987), De Leeuw, Groenen, and Pietersz (2004), and now again by Takane (2016), but with 
different derivations than those in the original paper. In this paper we reconstruct the original 
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derivation, slighty generalized to include weights. Note that De Leeuw, Groenen, and Pietersz 
(2004) also use weights, and also describe a close approximation of the original algorithm. 

Both Takane (1977) and Browne (1987) kncl the ELEGANT algorithm to be prohibitively 
slow to converge. After discussing the original algorithm and analyzing its convergence rate 
we use quadratic majorization to improve the convergence speed. The fact that the original 
ELEGANT algorithm can also be derived by using quadratic majorization was first observed 
by De Leeuw, Groenen, and Pietersz (2004), using a different derivation. 

2 The ELEGANT Algorithm 

The paper De Leeuw (1975) deals with the problem of minimizing the sstress loss function, 
defined on the space of n x p configurations as 

-inn 

°( X ) '■= 9 Z w iA e ij ~ e iA X )) 2 - (!) 

z i=1 j= 1 

Here the Wij are given non-negative weights and the e l3 are given non-negative dissimilarities. 
Both weights and dissimilarties are symmetric and hollow , i.e. have a zero diagonal. The 
6ij(X) are squared Euclidean distances , defined by 

eij{ X) \= (xi - Xj)\xi - xfi = [e-i - efifiXX '(a - efi) = tr X'AijX, 

where A l} := (e t — efi(e t — efi' and the e* are unit vectors with a 1 in position i and 0 
elsewhere. I am pretty sure the original formulation of ELEGANT did not include weights, 
but we include them here for the reasons simlar to those reviewed by Groenen and Van de 
Velclen (2016), section 6. The alternative derivations by Takane (1977) and Browne (1987) 
did not consider weights either, but De Leeuw, Groenen, and Pietersz (2004) did include 
weights. 

At that time we were trying to generalize the alternating least squares algorithms used for 
fitting non-metric linear and bilinear models to multidimensional scaling, i.e. to the least- 
squares fitting of distances and squared distances. One way to attack this problem is the 
approach chosen in ALSCAL (Takane, Young, and De Leeuw (1977)), where the problem of 
fitting the optimal configuration is attacked by block coordinate descent. The approach in 
De Leeuw (1975) is quite different, because it does not partition the variables into blocks, 
but instead uses augmentation (De Leeuw (1994)). Note that augmentation was applied to 
multidimensional scaling shortly before majorization became the preferred approach. In this 
context majorization methods are another way to extend alternating least squares. They 
were first announced in the proceedings of the 1975 US-Japan Seminar on Multidimensional 
Scaling and Related Techniques. 

To explain the original formulation we introduce, for each quadruple (i,j, k , /), 

eijki ■= {xi - x j )'(x k - xfi) = (e* - efi)'XX\e k - efi) = tr XX'(ei - efi){e k - efifi. 

Also define 
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^■ijkt • 


( e *j 

1 arbitrary 


if (i,j) = (k,£), 
otherwise, 


( 2 ) 


and w ijU := ^Jw^wu- 

We now define an augmented version of sstress, which is a function of both the configuration 
X and the disparities 


Y n n n n 

a t (x :,£):= 5 E LEE ^ijkii^ijkt t'ijki(X ')) , (3) 

z i =l j =l k =1 £=1 

The relation with sstress from (1) is 

<?{X) = min a*(X,E), 

EjEzC 

where £ is the set of disparities satisfying (2). The ELEGANT algorithm in De Leeuw (1975) 
uses alternating least squares to minimize <r*. Thus we alternate minimization over X keeping 
E fixed with minimization over E E £ keeping X fixed. 

Minimizing cr* over E E £ for fixed X is trivial. We simply set 


tijkl ■ 


( e *i 

I 6ijk£\Xj 


if (i,j) = (M), 
otherwise. 


( 4 ) 


The more complicated and interesting problem is to minimize cr* over X for fixed E. Observe 
first 

n n n n n n n n 

EEEE ^ij kt&ij kt EEEEvp(e, - e 3 )'XX\e k - e e ) = tr X'B(X)X, 

i=ij=ik=i£=i i=ij=ik=ie=i 

with 


b(x)-.= EL EE ^ijkt^-ijkti&i O) {^k &l) • 
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Next 

n n n n n n n n 1 1 

EEEE w ijkt^-ijkl — E E E E w; j wl,(e,-e 1 YXX'(e t -e c )(e k -e,YXX'(e,-e 1 ) = tr (VH) ! , 

i=lj=lfc=l^=l «=1 j = l fc=l £=1 

with 


V := 


it it 

E E w l A a- 


i= 1 j=l 


( 6 ) 
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It follows that the A" minimizing cr* for fixed E must satisfy the stationary equations 
B(X)X = VX(X'VX). If Z are the normalized eigenvectors corresponding with the p 
largest eigenvalues of the generalized eigen problem B(X)Z = VZ A, with Z’VZ = I, then 

x = Zhh. 

Working with four-dimensional arrays, and computing eijki(X) for all quadruples, is unattrac¬ 
tive. It can be avoided, however. Write (4) in the form 

€ijkt = S ik S j % + (1 - s ik 6 je )e ijki ( X) = A'w'b,, - 6ij ) + e ijkl (X) 

and substitute this into (5). After some simplification we find the value of B(X) at this 
to be 


B{ X) = R( X) + VXX'V, 


(7) 


with 


R(X) ; = E E »«(% - )A ii . (8) 

i=l j= 1 

Note that (6) says that Vij = —2 y/wEj for i ^ j, and the diagonal is filled in such that rows 
and columns add up to zero. In the same way (8) says r*j( X) = —Wij(e.ij — eij(X)) for i ^ j, 
and the diagonal elements again make rows and columns sum to zero. 

The algorithm is now simply to compute the update X by solving the generalized eigen 
problem defined by B(X k ) and V. Note that the fixed point equations for this algorithm 
(R(X) + VXX’V)X = VX(X'VX) are equivalent to R(X)X = 0, which are the stationary 
equations found by setting the derivatives of a equal to zero. It is easy to understand why, in 
my youthful enthusiasm, I baptized the algorithm ELEGANT. Note that if all off-diagonal 
weights are equal to one then VX = 2 nX, and thus B(X) = R(X) + 4n 2 A"A" / and we must 
compute eigenvalues and eigenvectors of A"A"' + 


3 Ekman example 

datafekman, package="smacof ") 

lbs<-attr (ekman, "Labels") 

ekman <- (1 - as.matrix (ekman)) ~ 2 

ha <- elegant (ekman, verbose = FALSE, itmax = 5000) 

The algorithm takes 3498 iterations to come up with sstress 3.3187849896 and the following 
solution. 
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Figure 1: Ekman Unweighted 

1 

Now consider using the weights Wij = (2e^) _1 . We have the approximation 


a(X) 


1 7i n 1 inn i 
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Thus using these weights approximates an unweighted least squares loss function defined on 
the distances, not the squared distances. 

n <- nrow (ekman) 

w <- ifelse (ekman == 0, 0, 1 / (2 * sqrt (ekman))) 

hb <- elegant (ekman, w = w, verbose = FALSE, itmax = 5000) 


After 306 iterations we find the following solution. 
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Figure 2: Ekman Weighted 

4 ELEGANT Majorization 

The ELEGANT algorithm uses augmentation. Suppose we want to minimize a function 
/ : X —» M, and the problem is too difficult or cumbersome to be attacked directly. It helps 
if we can find an augmented function g : X ® Y —> 1R. such that f(x ) = min yg y g(x, y) for all 
x G X and the subproblems of minimizing g over igl for fixed y G Y and minimizing over 
y EY for fixed x € X are both easy. We can then alternate the two subproblems to update 
iterate (x^ k \y^) by 


y{k+l) _ ar g m j n g(x^ k \ y), 
y&Y 

^(fc+l) _ ar g m j n |/ ( ' fc+1 )), 
xex 

and under weak conditions accumulation points of the sequence will be stationary points 
of / (see De Leeuw (1994)). In ELEGANT the augmentation variables y are the off-diagonal 
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elements of the four-dimensional array £ = {e^}. 

Majorization algorithms (nowadays often called the MM algorithms , see Lange (2016)) are 
augmentation algorithms that use a majorization function. Again / : X —> M, but now 
g : X ® X —» M satisfies g(x,y ) > /(x) for all x,y G X and g(x,x) = /(x) for all x G A". 
This implies /(x) = min ye xg(x,y) and x® = argmin yey g(x^ k \ y). Thus the majorization 
algorithm is simply 


x (k+1) = argming(x,x^). 

x&X 

One particular form of majorization we will use in this paper is the quadratic approximation 
method of Vosz and Eckhardt (1980) or Bohning and Lindsay (1988), also known as the 
quadratic upper bound method (Lange (2016), section 4.6). If / is twice differentiable and 
V 2 f(x) < B with B positive definite for all x G X then 

g(x, y) = f(y) + (x- y)'Vf(y ) + ^(x - y)'B{x - y) 
is a majorization function. The majorization algorithm is 

x (fc+ i) =x (fc) -r 1 D/(x w ). 

In order to get a quadratic upper bound for sstress we prove a useful inequality. 

Lemma 1: [Inequality] 


EE® Aj) < EE 4^ HEE 4 a « 

i=lj=l \i=l j=l J \i=lj=l 

Proof: Suppose X and Y are an arbitrary n x p matrices, and define C = XY' and 
c = vec(C'). Also let e ijke (X,Y) = (x* - Xj)'(y k - ye) and e i:j (X,Y) = (x* - x j ) , (y i - yf). 

Then 


j = V <g> V. 


E E w a e ij 

i= 1 3 =1 


(X, Y) = y L w ‘i tr AfjC'AtjC - V y w,j c'( Uj ® A tj ) c, 


(9) 


i =1 J=1 


*=ii=i 


and 


E E E E y ) = EEEE 


i=i j =i fc=i £=i 


*=i j=i /c=i ^=i 


tr EC'VC' 


Thus 


c ' EE (Aij (8) Aj) > c < c'(V <E> V)c 


i =i j =i 


c'(I/®P)c. 

( 10 ) 
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for all C of the form XY', which means for all C. QED 

The inequality in lemma 1 is now used to obtain a quadratic majorization of sstress. Think 
of sstress as a function of c = vec(C'), with C = XX'. Then 


T> a(c) — EE w v(Aij ® A-ij), (11) 

i=l j 1 

and thus for all pairs of square symmetric C and C 

a(C) < a(C ) - tr R(C)(C - C) + hr V(C - C)V(C - C). 

Differentiating the majorization function with repect to X and setting the derivative equal 
to zero gives 

(R{C) + VCV)X = VX(X'VX), 

which is the update formula for the ELEGANT algorithm. 

5 Beyond ELEGANT 

In the unweighted case we replace the current configuration X with the configuration giving 
the best rank p approximation of XX' + ^oR(X). This leads to slow convergence, because 
the inertia part A^A"' is weighted much more heavily than the change part R(X). Both De 
Leeuw (1975) and Browne (1987) suggest using a step size procedure than gives a larger 
weight to the change part. If the larger step size decreases the loss we go to the next iteration, 
if it does not decrease the loss we take a smaller step size and try again. This is somewhat 
ad hoc, but it works. 

The majorization derivation of ELEGANT, however, gives us a more rigorous way to speed 
up the algorithm. The expression (11) for the second derivatives of stress makes it possible 
to easily derive bounds of the form V 2 a(C) < f3I for some scalar /3, where < is used for the 
Loewner order of square symmetric matrices, with A < B if B — A is positive semiclehnite. 
The algorithm then computes best rank p approximations to A"A"' + ^R(X). For ELEGANT 
with unit weights /3 = 4 n 2 . 

For /3 we can make two choices. The first is the largest eigenvalue of the Hessian, the matrix 
on the right hand side of (11). 

Lemma 2: [Hessian] If Wij = 1 for all i ^ j then the largest eigenvalue of the Hessian is 
4n. 

Proof: Define the matrix S with elements = tr A^A^. S has the same eigenvalues 
as the Hessian from (11). All elements of S are non-negative, each row has An — 8 elements 
equal to +1 and two elements equal to +4. The other elements are zero. Thus the largest- 
(Frobenius) eigenvalue, corresponding to an eigenvector with all elements equal to +1, is An. 

QED 



Thus in the case of equal unit weights we can use (3 = 4 n instead of 4n 2 , which suggests the 
majorization algorithm based on the largest eigenvalue will be about n times as fast. It does 
require initially solving the eigenvalue problem for a large matrix of order n 2 , however. If we 
choose j3 as the trace of the Hessian we find j3 = 4 Y%= i I2j=% w ij , which becomes 4 n(n — 1) 
in the case of unit weights. This (almost) coincides with the original ELEGANT algorithm 
for the unweighted case. We have programmed the majorization methods with a scalar /?, 
and the resulting simplications, in a separate R function beyond. 

We’ll try these alternative majorization methods on the Ekman data. First with the eigenvalue 
bound. 

data(ekman, package="smacof ") 
ekman <- (1 - as.matrix (ekman)) ~ 2 

he <- beyond(ekman, bound = "eval", verbose = FALSE, itmax = 5000) 

The beyond algorithm now takes 298 iterations (instead of 3498) to come up with sstress 
3.3187849627, the same solution as ELEGANT. About 12 times faster in iteration count. 
Individual iterations are also faster, but there is some overhead because of the large initial 
eigenvalue problem. 

data(ekman, package="smacof ") 
ekman <- (1 - as.matrix (ekman)) ~ 2 

hd <- beyond(ekman, bound = "trace", verbose = FALSE, itmax = 5000) 

For the trace bound we need 3268 iterations to arrive at sstress sstress 3.3187849875, again 
the same solution as ELEGANT. There is no overhead to compute the eigenvalue bound, but 
the iteration gain is minimal (probably reflecting the difference between n 2 and n(n — 1)). 

Next we use microbenchmark (Mersmann (2015)) to compare computing times of ELEGANT 
and the two quadratic majorization bounds. 

## Unit: milliseconds 

## expr min 

## elegant(ekman, verbose = FALSE, itmax = 5000) 442.435321 

## beyond(ekman, verbose = FALSE, bound = "eval", itmax = 5000) 191.455628 
## beyond(ekman, verbose = FALSE, bound = "trace", itmax = 5000) 397.601953 
## lq mean median uq max neval cld 

## 465.593165 475.5283196 471.488237 481.8184840 582.078795 100 c 

## 208.286437 243.4682953 211.918767 308.3162115 326.199085 100 a 

## 419.952887 437.7884735 426.484254 438.1270405 546.422315 100 b 

We see a speedup factor of about 2.2 if we use the eigenvalue bound, which is 56, and about 
10 percent gain over ELEGANT if we use the trace bound, which is 4 x 13 x 14 = 728. The 
speed gain is considerably less than the gain in iteration count. For larger problems we expect 
a larger speed gain. Note our algorithms use the slanczos function from the mgev package 
(Wood (2016)) to compute one or a few dominant eigenvalues. 
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6 Appendix: Code 

suppressPackageStartupMessages(library (mgcv, quietly = TRUE)) 

e <- function (i, n) 

ifelse (i == l:n, 1, 0) 

a <- function (i, j, n) 

outer (e(i, n) - e(j, n), e(i, n) - e(j, n)) 

torgerson <- function (delta, p = 2) { 
doubleCenter <- function(x) { 
n <- dim(x) [1] 
m <- dim(x) [2] 
s <- sum(x) / (n * m) 
xr <- rowSums(x) / m 
xc <- colSums(x) / n 
return((x - outer(xr, xc, "+")) + s) 

> 

z <- slanczos(-doubleCenter (delta / 2), p) 
v <- pmax(z$values, 0) 
return(z$vectors %*% diag(sqrt (v))) 

} 

mpower <- function (x, p, eps = le-16) { 
z <- eigen (x, symmetric = TRUE) 
zval <- z$values[1: (nrow(x) - 1)] p 
zvec <- z$vectors[, l:(nrow(x) - 1)] 
return (zvec %*°/„ diag (zval) %*% t (zvec)) 

> 

elegant <- 

function (delta, 

w = 1 - diag (nrow (delta)), 

P = 2, 

xold = torgerson (delta, p), 
itmax = 1000, 
eps = le-10, 
debug = FALSE, 
verbose = TRUE) { 
itel <- 1 

v <- -2 * sqrt (w) 
diag (v) <- -rowSums (v) 
vinv <- mpower (v,-l / 2) 
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dold <- as.matrix (dist (xold)) ~ 2 
sold <- sum (w * (delta - dold) ~ 2) 
repeat { 

b <- -w * (delta - dold) 
diag (b) <- -rowSums (b) 
h <- tcrossprod (v %*% xold) 
s <- b + h 

z <- slanczos (vinv %*% s 7,*% vinv, p) 
u <- vinv */,*•/, z$vectors 
evz <- pmax(z$values, 0) 
xnew <- u */,*•/, diag(sqrt (evz)) 
dnew <- as.matrix (dist (xnew)) ~ 2 
snew <- sum (w * (delta - dnew) ~ 2) 
if (debug) 
browser () 
if (verbose) { 
cat ( 

formate (itel, width = 4, format = "d"), 
formate ( 
sold, 

digits = 10, 
width = 15, 
format = "f" 

), 

formate ( 
snew, 

digits = 10, 
width = 15, 
format = "f" 

), 

"\n" 

) 

} 

if ((itel == itmax) I I (sold - snew) < eps) 
break 

itel <- itel + 1 
xold <- xnew 
dold <- dnew 
sold <- snew 

> 

return (list ( 
x = xnew, 
d = dnew, 
itel = itel, 
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s = snew 

)) 

} 

beyond <- 

function (delta, 

w = 1 - diag (nrow (delta)), 

P = 2, 

xold = torgerson (delta, p), 
bound = "eval", 
itmax = 1000, 
eps = le-10, 
debug = FALSE, 
verbose = TRUE) { 
n <- nrow (delta) 
itel <- 1 

vv <- matrix (0, n ~ 2, n ~ 2) 
if (bound == "eval") { 
for (i in l:n) 
for (j in l:n) 

vv <- vv + w[i, j] * kronecker (a (i, j, n), a(i, j, n)) 
lbd <- slanczos(vv, l)$values 

> 

if (bound == "trace") 
lbd <- 4 * sum (w) 
dold <- as.matrix (dist (xold)) ~ 2 
sold <- sum (w * (delta - dold) ~ 2) 
repeat { 

b <- -w * (delta - dold) 
diag (b) <- -rowSums (b) 
h <- tcrossprod (xold) 
s <- b / lbd + h 
z <- slanczos (s, p) 
u <- z$vectors 
evz <- pmax(z$values, 0) 
xnew <- u ft*’/, diag(sqrt (evz)) 
dnew <- as.matrix (dist (xnew)) ~ 2 
snew <- sum (w * (delta - dnew) ~ 2) 
if (debug) 
browser () 
if (verbose) { 
cat ( 

formate (itel, width = 4, format = "d"), 
formate ( 
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sold, 

digits = 10, 
width = 15, 
format = "f" 

), 

formate ( 
snew, 

digits = 10, 
width = 15, 
format = "f" 

), 

"\n" 

) 

} 

if ((itel == itmax) I I (sold - snew) < eps) 
break 

itel <- itel + 1 
xold <- xnew 
dold <- dnew 
sold <- snew 

> 

return (list ( 
x = xnew, 
d = dnew, 
itel = itel, 
bound = lbd, 
s = snew 

)) 

} 
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